A Wikipedia-Based Multilingual Retrieval Model

نویسندگان

  • Martin Potthast
  • Benno Stein
  • Maik Anderka
چکیده

This paper introduces CL-ESA, a new multilingual retrieval model for the analysis of cross-language similarity. The retrieval model exploits the multilingual alignment of Wikipedia: given a document d written in language L we construct a concept vector d for d, where each dimension i in d quantifies the similarity of d with respect to a document di chosen from the “L-subset” of Wikipedia. Likewise, for a second document d′ written in language L′, L = L′, we construct a concept vector d′, using from the L′-subset of the Wikipedia the topic-aligned counterparts d′∗ i of our previously chosen documents. Since the two concept vectors d and d′ are collection-relative representations of d and d′ they are language-independent. I. e., their similarity can directly be computed with the cosine similarity measure, for instance. We present results of an extensive analysis that demonstrates the power of this new retrieval model: for a query document d the topically most similar documents from a corpus in another language are properly ranked. Salient property of the new retrieval model is its robustness with respect to both the size and the quality of the index document collection.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

I2R At ImageCLEF Wikipedia Retrieval 2010

We report on our approaches and methods for the ImageCLEF 2010 Wikipedia image retrieval task. A distinctive feature of this year’s image collection is that images are associated with unstructured and noisy textual annoations in three languages: English, French and German. Hence, besides following conventional text-based and multimodal approaches, we also focus some effort into investigating mu...

متن کامل

Dbnary: Wiktionary as a LMF based Multilingual RDF network

Contributive resources, such as wikipedia, have proved to be valuable in Natural Language Processing or Multilingual Information Retrieval applications. This article focusses on Wiktionary, the dictionary part of the collaborative resources sponsored by the Wikimedia

متن کامل

Dbnary: Wiktionary as a Lemon Based RDF Multilingual Lexical Resource

Contributive resources, such as Wikipedia, have proved to be valuable to Natural Language Processing or multilingual Information Retrieval applications. This work focusses on Wiktionary, the dictionary part of the resources sponsored by the Wikimedia foundation. In this article, we present our effort to extract multilingual lexical data from Wiktionary data and to provide it to the community as...

متن کامل

UAIC's Participation at Wikipedia Retrieval @ ImageCLEF 2011

This paper describes the participation of UAIC team at the ImageCLEF 2011 competition, Wikipedia Retrieval task. The aim of the task was to investigate retrieval approaches in the context of a large and heterogeneous collection of images and their noisy text annotations. We submitted a total of six runs, focusing our effort along the textual retrieval, query expansion on English language, combi...

متن کامل

DBnary: Wiktionary as a Lemon-based multilingual lexical resource in RDF

Contributive resources, such as Wikipedia, have proved to be valuable to Natural Language Processing or multilingual Information Retrieval applications. This work focusses on Wiktionary, the dictionary part of the resources sponsored by the Wikimedia foundation. In this article, we present our extraction of multilingual lexical data from Wiktionary data and to provide it to the community as a M...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008